In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.
If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ast
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, explained_variance_score, mean_squared_error, silhouette_score, roc_auc_score
from sklearn.preprocessing import Imputer, StandardScaler, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
import operator
from sklearn.cluster import KMeans
import gc
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
# magic word for producing visualizations in notebook
%matplotlib inline
There are four data files associated with this project:
Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.
The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.
Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. One of them is a top-level list of attributes and descriptions, organized by informational category. The other is a detailed mapping of data values for each feature in alphabetical order.
In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the .csv data files in this project that they're semicolon (;) delimited, so an additional argument in the read_csv() call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.
You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.
We start by loading the general population and the customers data
# gen_pop = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_AZDIAS_052018.csv', sep=';')
# customers = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_CUSTOMERS_052018.csv', sep=';')
gen_pop = pd.read_csv('gen_pop.csv')
customers = pd.read_csv('customers.csv')
One of the first steps that we're carrying out is to take a look at the top 5 rows of both datasets. On a quick look, they seem to have almost the same features. They have significantly different number of rows however. We will look on how the values of these features differ.
if 'Unnamed: 0' in gen_pop.columns:
gen_pop.drop(['Unnamed: 0'], axis=1, inplace= True)
if 'Unnamed: 0' in customers.columns:
customers.drop(['Unnamed: 0'], axis=1, inplace= True)
gen_pop.head()
customers.head()
Here we investigate the size of the two datasets and we can see that the general population dataset is significantly larger with a size of 2.7 GB for its 891221 rows as oppose to the 613.7 MB for the 191652 rows of the customers dataset. We will attempt to tackle this size issue by setting the columns to the right type in the preprocessing section.
Another point that is addressed below is the features present in the customer dataset but not in the general population. These are: 'ONLINE_PURCHASE', 'CUSTOMER_GROUP', 'PRODUCT_GROUP'
# Be sure to add in a lot more cells (both markdown and code) to document your
# approach and findings!
gen_pop.info(memory_usage='deep')
print('\n')
customers.info(memory_usage='deep')
print('\n number of columns in general population:', len(gen_pop.columns))
print('\n number of columns in customers:', len(customers.columns), '\n')
print(set(customers.columns)-set(gen_pop.columns))
In the two cells below we look at different attributes for each feature such as the mean, the min and max values. The difference between the two datasets is not easily discerned using the table form and given the large number of features. Therefore we will attempt to visualize these differences.
customers.describe()
gen_pop.describe()
The two cells below highlights the top columns in terms of size. These seem to be of an object or categorical type
(gen_pop.memory_usage(deep=True) * 1e-6).sort_values(ascending= False)[:10]
(customers.memory_usage(deep=True) * 1e-6).sort_values(ascending= False)[:10]
cols = gen_pop.columns
num_cols_gen = gen_pop._get_numeric_data().columns
non_numeric_col_gen_pop = list(set(cols) - set(num_cols_gen))
non_numeric_col_gen_pop
cols = customers.columns
num_cols = customers._get_numeric_data().columns
non_numeric_col_customers = list(set(cols) - set(num_cols))
non_numeric_col_customers
Exploring the correlation between the different features
corr = gen_pop[num_cols_gen].corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
square=True, cbar_kws={"shrink": .5})
Starting with loading the trainning and testing data
# df_train = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')
# df_test = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TEST.csv', sep=';')
df_train = pd.read_csv('df_train.csv')
df_test = pd.read_csv('df_test.csv')
Dropping the Unnamed: 0 column
if 'Unnamed: 0' in df_train.columns:
df_train.drop(['Unnamed: 0'], axis=1, inplace= True)
if 'Unnamed: 0' in df_test.columns:
df_test.drop(['Unnamed: 0'], axis=1, inplace= True)
Checking the shape of both datasets
print(df_train.shape)
print(df_test.shape)
Adjusting the df_train shape to match the df_test in order to simplify the plotting process
df_train = df_train.iloc[:42833,:]
The next two cells describe the dataset in terms of mean, min, max, etc. The differences will be highlighted in the upcoming plots.
df_train.describe()
df_test.describe()
Below we plot the response distribution within the training data. We can see that the data is unbalanced with respect to the response
sns.countplot(df_train['RESPONSE'])
plt.savefig('skewed_response.png')
print("There are {}% response values with 1".format(100 * df_train["RESPONSE"].value_counts()[1]/df_train.shape[0]))
# from https://www.kaggle.com/gpreda/santander-eda-and-prediction
def plot_feature_distribution(df1, df2, label1, label2, features, col=10):
""" A function that ploter the feature distribution
Inputs:
df1, df2: the comparsion dataframes
labels: labeling the two distributions
features: the feature to be compared on
"""
i = 0
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(10,10,figsize=(18,22))
for feature in features:
i += 1
plt.subplot(col,10,i)
try:
sns.distplot(df1[feature], hist=False,label=label1)
except:
pass
try:
sns.distplot(df2[feature], hist=False,label=label2)
except:
pass
plt.xlabel(feature, fontsize=9)
locs, labels = plt.xticks()
plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)
plt.tick_params(axis='y', which='major', labelsize=6)
plt.savefig('response_0_vs_1_in_train.png')
plt.show();
Here we explore the features distribution in the training data between those with a response and those with none
t0 = df_train.loc[df_train['RESPONSE'] == 0]
t1 = df_train.loc[df_train['RESPONSE'] == 1]
features = num_cols[:100]
plot_feature_distribution(t0, t1, '0', '1', features)
t0 = df_train.loc[df_train['RESPONSE'] == 0]
t1 = df_train.loc[df_train['RESPONSE'] == 1]
features = num_cols[100:200]
plot_feature_distribution(t0, t1, '0', '1', features)
t0 = df_train.loc[df_train['RESPONSE'] == 0]
t1 = df_train.loc[df_train['RESPONSE'] == 1]
features = num_cols[200:300]
plot_feature_distribution(t0, t1, '0', '1', features)
t0 = df_train.loc[df_train['RESPONSE'] == 0]
t1 = df_train.loc[df_train['RESPONSE'] == 1]
features = num_cols[300:360]
plot_feature_distribution(t0, t1, '0', '1', features, col=6)
We already can see that for some features, both have the distribution whereas for some there is a clear difference. These will be closely monitored in the modeling section
Now we move to focus on the distribution between training and testing data. From the plots below, we can see that there is no difference in distribution between the training and testing data.
features = num_cols[:100]
plot_feature_distribution(df_train, df_test, 'train', 'test', features)
features = num_cols[100:200]
plot_feature_distribution(df_train, df_test, 'train', 'test', features)
features = num_cols[200:300]
plot_feature_distribution(df_train, df_test, 'train', 'test', features)
First we check for duplicates
gen_pop.duplicated().sum()
We will utilize the functions from below to replace the specific unkown values for each feature with NaNs to capture the true NaN count in the data.
def replaceMissingWithNaN(col, missing):
""" A function that returns the number of missing values per column and a list representing the column with missing values converted to NaN
input:
col: a Series that contains NaN
missing: values that are considered NaN for this feature
Outpu:
count: number of missing values
lst: the col with NaN now correctly presented as NaN
"""
missing.extend(["XX", "X"])
count = col.isna().sum()
lst = []
for item in col:
if item in missing:
count += 1
lst.append(np.NaN)
else:
lst.append(item)
return count, lst
def findNans(df, missing):
""" a function that utilizes replaceMissingWithNaN to provide a DF with all missing values converted to NaN
Input:
df: dataframe to find NaN for
missing: dataframe with each attribute missing values glossary
output:
count_nan: number of NaN per feature
new_df: the df with now all NaN presented as NaN
"""
headers = df.columns.values
counts = []
new_df = pd.DataFrame()
for i in range(len(df.columns.values)):
feat = df[headers[i]]
try:
miss = list(missing_values.loc[headers[i], :].values) #missing[i]
except KeyError:
miss = []
count , new_df[headers[i]] = replaceMissingWithNaN(feat, miss)
counts.append([headers[i], count])
count_nan = pd.DataFrame(counts, columns= ['feature', 'NaN_count'])
return count_nan, new_df
def findNansPrep(attributes_values):
""" A function that prepares a df that contains each attribute with its nan code
Input: attributes_values: a missing value glossary - dataframe
Output: dataframe with each attribute missing values glossary
"""
missing_values = attributes_values[~attributes_values.Attribute.isnull()]
missing_values.drop(columns=[missing_values.columns[0],'Description'], axis=1,inplace= True)
missing_values.Meaning.replace({'unknown / no main age detectable': 'unknown'}, inplace= True)
missing_values = missing_values[missing_values.Meaning == 'unknown']
missing_values.drop(['Meaning'], axis=1, inplace= True)
missing_values.set_index('Attribute', inplace= True)
return missing_values
We will load the excel file that specify which value does correspond to NaN for each feature
# read xsl file and get values for NaN for each feature
attributes_values = pd.read_excel('DIAS Attributes - Values 2017.xlsx', header= 1)
missing_values = findNansPrep(attributes_values)
Missing data per feature using the functions shown earlier
nan_count, gen_pop_with_nan = findNans(gen_pop, missing_values)
Here we calculate NaN as a percentage for each feature
nan_count['NaN_perc'] = nan_count.NaN_count.apply(lambda x: x/gen_pop_with_nan.shape[0])
We set a limit beyond which the feature is a candidate to be dropped
fig, ax = plt.subplots(figsize=[10, 5])
plt.hist(nan_count['NaN_perc'], bins=20);
plt.xlabel('percentage of missing data', fontsize=18)
plt.ylabel('frequency', fontsize=16)
fig.savefig('missing_data_feat_hist.png')
feature_NaN_limit = 0.4
nan_count[nan_count.NaN_perc> feature_NaN_limit]
AGER_TYP: High correlation with FINANZ_SPARER, but linear regression shows bad fit, DROP
ALTER_HH: No good correlation but relatively lowr NaN percent, KEEP
ALTER_KIND: Good correlation with similar features that have a lot of data missing, DROP
EXTSEL992: Decent correlation with VK_DISTANZ, but linear regression shows bad fit, DROP
KK_KUNDENTYP: Good correlation with D19_GESAMT_DATUM, KEEP
feats = nan_count[nan_count.NaN_perc> feature_NaN_limit].feature
for feat in feats:
print("\n", feat, "\n", corr.loc[feat, :].nlargest())
print("\n", corr.loc[feat, :].nsmallest())
The cell below shows that all the ALTER_KIND data can not be used to imputate the missing values between them
gen_pop_with_nan[~gen_pop_with_nan.ALTER_KIND1.isnull() | ~gen_pop_with_nan.ALTER_KIND2.isnull() | ~gen_pop_with_nan.ALTER_KIND3.isnull() | ~gen_pop_with_nan.ALTER_KIND4.isnull()].shape[0]/gen_pop_with_nan.shape[0]
def linearRegFillNa(df, feature, util_feature):
""" A function that utilize linear regression to fill NaN
Input:
df: dataframe to be utilized
feature: for which linear regression is applied
util_feature: the feature utilized to extrapolate the wanted feature
output:
predicted_feature: A series that represent the feature after extrapolation
"""
shared_data_util_feature = df[~df[util_feature].isnull() & ~df[feature].isnull()][util_feature]
shared_data_util_feature = shared_data_util_feature.values.reshape(-1, 1)
shared_data_feature = df[~df[util_feature].isnull() & ~df[feature].isnull()][feature]
shared_data_feature = shared_data_feature.values.reshape(-1, 1)
reg = LinearRegression().fit(shared_data_util_feature, shared_data_feature)
available_util_feature = df[~df[util_feature].isnull()][util_feature]
available_util_feature = available_util_feature.values.reshape(-1, 1)
predicted_feature = reg.predict(available_util_feature)
predicted_feature = np.round(predicted_feature)
return predicted_feature
def regFillApply(df, feature, util_feature):
""" A function that utilize linearRegFillNa to add a new extrapolated feature to df
Input:
df: dataframe to be utilized
feature: for which linear regression is applied
util_feature: the feature utilized to extrapolate the wanted feature
output:
None - it changes the df
"""
df[feature + "_corr"] = np.nan
df.loc[~df[util_feature].isnull(), feature + "_corr"] = linearRegFillNa(df, feature, util_feature)
print("r2=",r2_score(df[~df[util_feature].isnull() & ~df[feature].isnull()][feature], df[~df[util_feature].isnull() & ~df[feature].isnull()][feature + "_corr"]))
print("MSE=",mean_squared_error(df[~df[util_feature].isnull() & ~df[feature].isnull()][feature], df[~df[util_feature].isnull() & ~df[feature].isnull()][feature + "_corr"]))
df[feature + "_mod"] = df[feature]
df[feature + "_mod"].fillna(df[feature + "_corr"], inplace= True)
regFillApply(gen_pop_with_nan, 'AGER_TYP', 'FINANZ_SPARER')
regFillApply(gen_pop_with_nan, 'EXTSEL992', 'VK_DISTANZ')
regFillApply(gen_pop_with_nan, 'KK_KUNDENTYP', 'D19_GESAMT_DATUM')
gen_pop_with_nan.columns[-6:].values
f, ax = plt.subplots(figsize=(11, 9))
plt.xlim(0, 7)
plt.ylim(0, 7)
plt.gca().set_aspect('equal', adjustable='box')
plt.draw()
gen_pop_with_nan.plot.hexbin(x='KK_KUNDENTYP_corr', y='KK_KUNDENTYP', gridsize=10, ax=ax)
f, ax = plt.subplots(figsize=(11, 9))
plt.xlim(0, 7)
plt.ylim(0, 7)
plt.gca().set_aspect('equal', adjustable='box')
plt.draw()
gen_pop_with_nan.plot.hexbin(x='KK_KUNDENTYP_mod', y='KK_KUNDENTYP', gridsize=10, ax=ax)
gen_pop_with_nan.drop(['AGER_TYP', 'AGER_TYP_corr','AGER_TYP_mod', 'EXTSEL992', 'EXTSEL992_corr', 'EXTSEL992_mod', 'KK_KUNDENTYP', 'KK_KUNDENTYP_corr'], axis=1, inplace=True)
Missing dat per row
# percentage of NaNs per row
gen_pop_with_nan['row_percent_nan']= gen_pop_with_nan.isnull().mean(axis=1)
plt.hist(gen_pop_with_nan['row_percent_nan'])
row_NaN_limit = 0.3
gen_pop_low_nan = gen_pop_with_nan[gen_pop_with_nan['row_percent_nan'] < row_NaN_limit]
gen_pop_with_nan.shape[0]-gen_pop_low_nan.shape[0]
gen_pop_low_nan.drop(['row_percent_nan'], axis=1, inplace= True)
gen_pop_low_nan[non_numeric_col_gen_pop].head()
gen_pop_low_nan['OST_WEST_KZ'].value_counts()
gen_pop_low_nan.OST_WEST_KZ = gen_pop_low_nan.OST_WEST_KZ.replace({"W": 0, "O": 1})
gen_pop_low_nan['EINGEFUEGT_AM'] = pd.to_datetime(gen_pop_low_nan['EINGEFUEGT_AM'])
gen_pop_low_nan['EINGEFUEGT_AM_weekday'] = gen_pop_low_nan['EINGEFUEGT_AM'].apply(lambda x: x.weekday())
gen_pop_low_nan['EINGEFUEGT_AM_day'] = gen_pop_low_nan['EINGEFUEGT_AM'].apply(lambda x: x.day)
gen_pop_low_nan['EINGEFUEGT_AM_month'] = gen_pop_low_nan['EINGEFUEGT_AM'].apply(lambda x: x.month)
gen_pop_low_nan['EINGEFUEGT_AM_year'] = gen_pop_low_nan['EINGEFUEGT_AM'].apply(lambda x: x.year)
gen_pop_low_nan.drop(['EINGEFUEGT_AM'], axis= 1, inplace= True)
gen_pop_low_nan['CAMEO_INTL_2015_adj'] = gen_pop_low_nan['CAMEO_INTL_2015'].apply(lambda x: int(x) if x == x else np.nan)
gen_pop_low_nan['CAMEO_INTL_2015_adj'].value_counts().shape
gen_pop_low_nan.drop(['CAMEO_INTL_2015'], axis=1, inplace= True)
def getWealth(col):
lst = []
for item in col:
x = float(item)
if x > 50:
lst.append(0)
elif x > 35:
lst.append(1)
elif x > 25:
lst.append(2)
elif x > 15:
lst.append(3)
elif x > -1:
lst.append(4)
else:
lst.append(np.nan)
return lst
def getLifeStage(col):
lst = []
for item in col:
if str(item)[1]== '1':
lst.append(0)
elif str(item)[1]== '2':
lst.append(1)
elif str(item)[1]== '3':
lst.append(2)
elif str(item)[1]== '4':
lst.append(3)
elif str(item)[1]== '5':
lst.append(4)
else:
lst.append(np.nan)
return lst
gen_pop_low_nan["wealth"] = getWealth(gen_pop_low_nan["CAMEO_INTL_2015_adj"])
gen_pop_low_nan["life_stage"] = getLifeStage(gen_pop_low_nan['CAMEO_INTL_2015_adj'])
gen_pop_low_nan.drop(['CAMEO_INTL_2015_adj'], axis= 1, inplace= True)
gen_pop_low_nan['CAMEO_DEUG_2015_adj'] = gen_pop_low_nan['CAMEO_DEUG_2015'].apply(lambda x: int(x) if x == x else np.nan)
gen_pop_low_nan['CAMEO_DEUG_2015_adj'].value_counts()
gen_pop_low_nan.drop(['CAMEO_DEUG_2015'], axis= 1, inplace= True)
gen_pop_low_nan['D19_LETZTER_KAUF_BRANCHE'].value_counts()
gen_pop_low_nan['CAMEO_DEU_2015'].value_counts()
gen_pop_low_nan.shape
gc.collect()
gen_pop_low_nan_encoded = pd.get_dummies(gen_pop_low_nan, columns = ['D19_LETZTER_KAUF_BRANCHE','CAMEO_DEU_2015'])
# gen_pop_low_nan_encoded.to_csv('gen_pop_low_nan_encoded.csv')
# gen_pop_low_nan_encoded = pd.read_csv('gen_pop_low_nan_encoded.csv')
More mixed features engineering
wealth: 0 poor: 1,2,5,6,14,15,21,24,29,31 1 mid: 3,4,7,8,16,22,25,30,32 2 rich: 10,13,18,20,23,28,35,39,40
status: 0 single:1-13 and 21-23 1 couple:14-20 2 family: 24-40
age: 0: 1,3,14,18,29,30,33,34,35 1: 2,4,39 2: 5,7,11,37 3: 6, 8,12, 13, 15,16,19,20, 31,32,36,38,40
def wealth_soc_stat_age(col):
""" A function that calculates wealth, social status and age features from LP_LEBENSPHASE_FEIN
Input:
col: the original LP_LEBENSPHASE_FEIN col
Output:
wealth: wealth feature - Series
stat: social status feature - Series
age: age feature - Series
"""
wealth = []
stat = []
age = []
for item in col:
if item in [1,2,5,6,14,15,21,24,29,31]:
wealth.append(0)
elif item in [3,4,7,8,16,22,25,30,32]:
wealth.append(1)
elif item in [10,13,18,20,23,28,35,39,40]:
wealth.append(2)
else:
wealth.append(np.nan)
for item in col:
if item<=13 or 21<=item<=23:
stat.append(0)
elif 14<=item<=20:
stat.append(1)
elif item>=24:
stat.append(2)
else:
stat.append(np.nan)
for item in col:
if item in [1,3,14,18,29,30,33,34,35]:
age.append(0)
elif item in [2,4,39]:
age.append(1)
elif item in [5,7,11,37]:
age.append(2)
elif item in [6, 8,12, 13, 15,16,19,20, 31,32,36,38,40]:
age.append(3)
else:
age.append(np.nan)
return wealth, stat, age
def wealth_soc_stat_LP_LEBENSPHASE_GROB(col):
""" A function that calculates wealth, social status features from LP_LEBENSPHASE_GROB
Input:
col: the original LP_LEBENSPHASE_GROB col
Output:
wealth: wealth feature - Series
stat: social status feature - Series
"""
wealth = []
stat = []
for item in col:
if item in [1,2,4,7,10]:
wealth.append(0)
elif item in [9]:
wealth.append(1)
elif item in [3,5,8,11,12]:
wealth.append(2)
else:
wealth.append(np.nan)
for item in col:
if item <=7:
stat.append(0)
elif item > 7:
stat.append(1)
else:
stat.append(np.nan)
return wealth, stat
gen_pop_low_nan_encoded['LP_LEBENSPHASE_FEIN_wealth'], gen_pop_low_nan_encoded['LP_LEBENSPHASE_FEIN_stat'], gen_pop_low_nan_encoded['LP_LEBENSPHASE_FEIN_age'] = wealth_soc_stat_age(gen_pop_low_nan_encoded.LP_LEBENSPHASE_FEIN)
gen_pop_low_nan_encoded['LP_LEBENSPHASE_GROB_wealth'], gen_pop_low_nan_encoded['LP_LEBENSPHASE_GROB_stat'] = wealth_soc_stat_LP_LEBENSPHASE_GROB(gen_pop_low_nan_encoded.LP_LEBENSPHASE_GROB)
def getDecade(col):
""" a function that calculated the decade of a person's youth movement from PRAEGENDE_JUGENDJAHRE
Input:
col: PRAEGENDE_JUGENDJAHRE
output:
lst: the decade of a person's youth movement - list
"""
lst = []
for item in col:
if item==1 or item==2:
lst.append(1)
elif item==3 or item==4:
lst.append(2)
elif item>4 and item<8:
lst.append(3)
elif item==8 or item==9:
lst.append(4)
elif item>9 and item<14:
lst.append(5)
elif item==14 or item==15:
lst.append(6)
else:
lst.append(np.nan)
return lst
def getMovement(col):
""" a function that calculates the movement of interest during a person's youth from PRAEGENDE_JUGENDJAHRE
input:
col: PRAEGENDE_JUGENDJAHRE
output:
lst: the movement of interest during a person's youth - list
"""
lst = []
for item in col:
if item in [1, 3, 5, 8, 10, 12, 14]:
lst.append(0)
elif item in [2, 4, 6, 7, 9, 11, 13, 15]:
lst.append(1)
else:
lst.append(np.nan)
return lst
gen_pop_low_nan_encoded['Movement'] = getMovement(gen_pop_low_nan_encoded['PRAEGENDE_JUGENDJAHRE'])
gen_pop_low_nan_encoded['Decade']= getDecade(gen_pop_low_nan_encoded['PRAEGENDE_JUGENDJAHRE'])
gen_pop_low_nan_encoded['WOHNLAGE_adj'] = gen_pop_low_nan_encoded.WOHNLAGE.apply(lambda x: x if x < 7 else np.nan)
gen_pop_low_nan_encoded.drop(['WOHNLAGE'], axis= 1, inplace= True)
gen_pop_low_nan_encoded['PLZ8_BAUMAX_adj'] = gen_pop_low_nan_encoded.PLZ8_BAUMAX.apply(lambda x: x if x < 5 else np.nan)
gen_pop_low_nan_encoded.drop(['PLZ8_BAUMAX'], axis= 1, inplace= True)
gen_pop_low_nan_encoded.drop(["ALTER_KIND1","ALTER_KIND2","ALTER_KIND3","ALTER_KIND4"], axis= 1, inplace= True)
gen_pop_low_nan_encoded.shape
gc.collect()
# gen_pop_low_nan_encoded.to_csv('gen_pop_low_nan_encoded_mixed_resolved_pending_eng.csv')
del gen_pop, gen_pop_low_nan, gen_pop_with_nan
gc.collect()
idx = gen_pop_low_nan_encoded.columns.values[1:]
for df in [gen_pop_low_nan_encoded]:
df['sum'] = df[idx].sum(axis=1)
df['min'] = df[idx].min(axis=1)
df['max'] = df[idx].max(axis=1)
df['mean'] = df[idx].mean(axis=1)
df['std'] = df[idx].std(axis=1)
df['skew'] = df[idx].skew(axis=1)
df['kurt'] = df[idx].kurtosis(axis=1)
df['med'] = df[idx].median(axis=1)
# gen_pop_low_nan_encoded.to_pickle('gen_pop_low_nan_encoded_few_eng_feat.pkl')
gen_pop_low_nan_encoded = pd.read_pickle('gen_pop_low_nan_encoded_few_eng_feat.pkl')
gen_pop_low_nan_encoded.drop(columns=['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'PRAEGENDE_JUGENDJAHRE'], axis= 1, inplace= True)
gen_pop_low_nan_encoded.head()
Now we start the imputation, scaling process and PCA to start the segmentation process
imp = SimpleImputer(missing_values= np.nan, strategy='mean')
gen_pop_no_nan = imp.fit_transform(gen_pop_low_nan_encoded)
scaler = RobustScaler()
gen_pop_no_nan_scaled = scaler.fit_transform(gen_pop_no_nan)
pca = PCA()
X_pca = pca.fit_transform(gen_pop_no_nan_scaled)
The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.
n = len(pca.explained_variance_ratio_)
vr = pca.explained_variance_ratio_
index = np.arange(1, n + 1)
cum_vr = np.cumsum(vr)
plt.figure()
ax = plt.subplot()
ax.plot(index, cum_vr)
ax.set_xlabel("Principal Component")
ax.set_ylabel("Explained Variance")
plt.savefig('pca_vr.png')
# Re-apply PCA to the data while selecting for number of components to retain.
pca = PCA(n_components=100, random_state=42)
X_pca = pca.fit_transform(gen_pop_no_nan_scaled)
pca.explained_variance_ratio_.sum()
def getWeight(pca, component, df, abs):
""" A function that calculates the weights of features for each component of pca
input:
pca: pca of interest
component: the number for the component we're interested in investigating - int
df: the dataframe utilized
abs: a flag to get the absolute value or not - bool
"""
weights = {}
index_for_weights = df.columns
for i in range(pca.components_.shape[1]):
if abs:
weights[index_for_weights[i]] = abs(pca.components_[component-1][i])
else:
weights[index_for_weights[i]] = pca.components_[component-1][i]
sort_weights = sorted(weights.items(), key= operator.itemgetter(1))
return sort_weights
getWeight(pca, 1, gen_pop_low_nan_encoded, False)
status from car ownership
getWeight(pca, 2, gen_pop_low_nan_encoded, False)
level of recent activity
getWeight(pca, 3, gen_pop_low_nan_encoded, False)
offline activity
score = []
for i in range(2,50):
kmean_init = KMeans(n_clusters=i, random_state=42)
model = kmean_init.fit(X_pca)
score.append(abs(model.score(X_pca)))
print("\nDone with i:",i,"The score is:",score[i-2])
plt.plot(list(range(2,50)), score)
plt.xlabel('K')
plt.ylabel('SSE')
t= """Done with i: 2 The score is: 322575926.1651047
Done with i: 3 The score is: 313364868.9455074
Done with i: 4 The score is: 294083124.38877946
Done with i: 5 The score is: 289608968.5943805
Done with i: 6 The score is: 285799198.03568727
Done with i: 7 The score is: 282779062.51599026
Done with i: 8 The score is: 280271214.7924869
Done with i: 9 The score is: 278013506.2584997
Done with i: 10 The score is: 276143756.39900583
Done with i: 11 The score is: 274601071.3823531
Done with i: 12 The score is: 273162607.5241778
Done with i: 13 The score is: 271971142.77329594
Done with i: 14 The score is: 270958796.31468433
Done with i: 15 The score is: 270032405.01610416
Done with i: 16 The score is: 269158948.1825118
Done with i: 17 The score is: 267972887.09040833
Done with i: 18 The score is: 267136341.61606655
Done with i: 19 The score is: 266698935.31224236
Done with i: 20 The score is: 266108016.88404578
Done with i: 21 The score is: 265039423.27700454
Done with i: 22 The score is: 264688265.60141635
Done with i: 23 The score is: 264051640.00621733
Done with i: 24 The score is: 262881819.14802387
Done with i: 25 The score is: 262337590.6471799
Done with i: 26 The score is: 262184419.8864755
Done with i: 27 The score is: 261115804.98523906
Done with i: 28 The score is: 260998236.72914648
Done with i: 29 The score is: 260057365.04721287
Done with i: 30 The score is: 259490030.89395157
Done with i: 31 The score is: 258799883.74145913
Done with i: 32 The score is: 258565547.29967064
Done with i: 33 The score is: 258335945.05115688"""
score = re.findall("\d+\.\d+", t)
plt.plot(list(range(2,34)), [float(val) for val in score])
plt.xlabel('K')
plt.ylabel('SSE')
k = 10
kmeans = KMeans(n_clusters= k, random_state=42)
final_model = kmeans.fit(X_pca)
gen_population_predict = final_model.predict(X_pca)
def clean_data(df, no_del=True):
"""
Perform feature trimming, re-encoding, and engineering for demographics
data
INPUT: Demographics DataFrame, no_del: flag to cancel row elimination
OUTPUT: Trimmed and cleaned demographics DataFrame
"""
# Put in code here to execute all main cleaning steps:
# Get rid of Unnamed 0 col
if 'Unnamed: 0' in df.columns:
df.drop(['Unnamed: 0'], axis=1, inplace= True)
# convert missing value codes into NaNs, ...
print("initial", df.shape)
_, df = findNans(df, missing_values)
print("after findNans", df.shape)
# extrapolate high NaN but high correlation features
regFillApply(df, 'KK_KUNDENTYP', 'D19_GESAMT_DATUM')
# remove selected columns and rows, ...
to_be_removed = ['AGER_TYP', 'ALTER_KIND1', 'ALTER_KIND2', 'ALTER_KIND3', 'ALTER_KIND4',
'EXTSEL992', 'KK_KUNDENTYP', 'KK_KUNDENTYP_corr']
df.drop(to_be_removed, axis=1, inplace=True)
if no_del:
df['percent_NaN'] = df.isnull().mean(axis=1)
df = df.loc[df['percent_NaN'] <= 0.3, :]
print("after removing some rows", df.shape)
df.drop(['percent_NaN'], axis=1, inplace=True)
print("after removing percent_NaN", df.shape)
# select, re-encode, and engineer column values.
try:
df.OST_WEST_KZ.replace({"W": 0, "O": 1}, inplace=True)
except:
pass
print("after OST_WEST_KZ", df.shape)
df['EINGEFUEGT_AM'] = pd.to_datetime(df['EINGEFUEGT_AM'])
df['EINGEFUEGT_AM_weekday'] = df['EINGEFUEGT_AM'].apply(lambda x: x.weekday())
df['EINGEFUEGT_AM_day'] = df['EINGEFUEGT_AM'].apply(lambda x: x.day)
df['EINGEFUEGT_AM_month'] = df['EINGEFUEGT_AM'].apply(lambda x: x.month)
df['EINGEFUEGT_AM_year'] = df['EINGEFUEGT_AM'].apply(lambda x: x.year)
print("after engineering EINGEFUEGT Features")
df['CAMEO_INTL_2015_adj'] = df['CAMEO_INTL_2015'].apply(lambda x: int(x) if x == x else np.nan)
if 'CAMEO_INTL_2015_adj' in df.columns:
df["wealth"] = getWealth(df["CAMEO_INTL_2015"])
df["life_stage"] = getLifeStage(df['CAMEO_INTL_2015'])
print("after engineering wealth, life_stage")
df['CAMEO_DEUG_2015_adj'] = df['CAMEO_DEUG_2015'].apply(lambda x: int(x) if x == x else np.nan)
if 'PRAEGENDE_JUGENDJAHRE' in df.columns:
df['Movement'] = getMovement(df['PRAEGENDE_JUGENDJAHRE'])
df['Decade']= getDecade(df['PRAEGENDE_JUGENDJAHRE'])
print("after engineering Movement and Decade")
df['LP_LEBENSPHASE_FEIN_wealth'], df['LP_LEBENSPHASE_FEIN_stat'], df['LP_LEBENSPHASE_FEIN_age'] = wealth_soc_stat_age(df.LP_LEBENSPHASE_FEIN)
df['LP_LEBENSPHASE_GROB_wealth'], df['LP_LEBENSPHASE_GROB_stat'] = wealth_soc_stat_LP_LEBENSPHASE_GROB(df.LP_LEBENSPHASE_GROB)
print("after engineering FEIN_wealth,stat,age and GROB_wealth,stat")
df['WOHNLAGE_adj'] = df.WOHNLAGE.apply(lambda x: x if x < 7 else np.nan)
df['PLZ8_BAUMAX_adj'] = df.PLZ8_BAUMAX.apply(lambda x: x if x < 5 else np.nan)
df.drop(['EINGEFUEGT_AM','CAMEO_INTL_2015', 'CAMEO_INTL_2015_adj','CAMEO_DEUG_2015', 'PRAEGENDE_JUGENDJAHRE','LP_LEBENSPHASE_FEIN','LP_LEBENSPHASE_GROB','WOHNLAGE','PLZ8_BAUMAX'], axis= 1, inplace= True)
print("before hot encoding", df.shape)
df = pd.get_dummies(df, columns = ['D19_LETZTER_KAUF_BRANCHE','CAMEO_DEU_2015'])
print("after hot encoding", df.shape)
idx = df.columns.values[1:]
df['sum'] = df[idx].sum(axis=1)
df['min'] = df[idx].min(axis=1)
df['max'] = df[idx].max(axis=1)
df['mean'] = df[idx].mean(axis=1)
df['std'] = df[idx].std(axis=1)
df['skew'] = df[idx].skew(axis=1)
df['kurt'] = df[idx].kurtosis(axis=1)
df['med'] = df[idx].median(axis=1)
print("after engineering sum, min, max, etc features")
return df
customers_processed = clean_data(customers)
customers_processed.drop(['CUSTOMER_GROUP', 'ONLINE_PURCHASE', 'PRODUCT_GROUP'], axis= 1, inplace= True)
customers_imp = imp.transform(customers_processed)
customers_scaled = scaler.transform(customers_imp)
customers_pca = pca.transform(customers_scaled)
customers_clusters = final_model.predict(customers_pca)
# Compare the proportion of data in each cluster for the customer data to the
# proportion of data in each cluster for the general population.
prop_gen_pop = []
prop_cust_pop = []
counts = np.unique(customers_clusters, return_counts=True)[1]
for i in range(10):
prop_cust_pop.append(counts[i]/len(customers_clusters))
counts = np.unique(gen_population_predict, return_counts=True)[1]
for i in range(10):
prop_gen_pop.append(counts[i]/len(gen_population_predict))
temp_df = {'cluster': list(range(10)), 'prop_cust': prop_cust_pop, 'prop_gen': prop_gen_pop}
prop_df = pd.DataFrame(temp_df)
prop_df.plot(x='cluster', y = ['prop_cust', 'prop_gen'], kind='bar')
# What kinds of people are part of a cluster that is overrepresented in the
# customer data compared to the general population?
over_rep_inv_pca = pca.inverse_transform(customers_pca[np.where(customers_clusters==0)])
over_rep = scaler.inverse_transform(over_rep_inv_pca).round()
over_rep_df = pd.DataFrame(data= over_rep, columns=customers_processed.columns)
over_rep_df = over_rep_df.applymap(lambda x: x if x>0 else 0)
over_rep_df.mean().sort_values()
gen_pop_clust_inv_pca = pca.inverse_transform(X_pca[np.where(gen_population_predict!=0)])
gen_pop_clust = scaler.inverse_transform(gen_pop_clust_inv_pca).round()
gen_pop_clust_df = pd.DataFrame(data= gen_pop_clust, columns=customers_processed.columns)
gen_pop_clust_df = gen_pop_clust_df.applymap(lambda x: x if x>0 else 0)
sort_diff = (over_rep_df.mean() - gen_pop_clust_df.mean()).sort_values()
sort_diff
ind = np.abs(sort_diff).sort_values(ascending=False)[1:40].index
over_rep_top_10 = over_rep_df[ind].mean().round()
gen_pop_clust_top_10 = gen_pop_clust_df[ind].mean().round()
compare_dict = {'index': list(ind), 'over_rep_top_10': over_rep_top_10, 'gen_pop_clust_top_10': gen_pop_clust_top_10}
compare_df = pd.DataFrame(compare_dict)
compare_df.plot(x='index', y = ['over_rep_top_10', 'gen_pop_clust_top_10'], kind='bar',figsize=(12,6))
over_rep_df.ANZ_HAUSHALTE_AKTIV.mean()
gen_pop_clust_df.ANZ_HAUSHALTE_AKTIV.mean()
gen_pop_clust_df.ANZ_HAUSHALTE_AKTIV.mean()-over_rep_df.ANZ_HAUSHALTE_AKTIV.mean()
new_df = pd.concat([gen_pop_clust_df.mean(), over_rep_df.mean()], axis= 1)
new_df_2 = new_df.applymap(lambda x: x if x != 0 else x+0.001)
(0.003489-0.189585)/0.189585
new_df.loc['CAMEO_DEU_2015_9B',:]
sort_diff2 = new_df.pct_change(axis='columns')[1].sort_values()
ind = np.abs(sort_diff2).sort_values(ascending=False)[1:40].index
over_rep_top_10 = over_rep_df[ind].mean().round()
gen_pop_clust_top_10 = gen_pop_clust_df[ind].mean().round()
compare_dict = {'index': list(ind), 'over_rep_top_10': over_rep_top_10, 'gen_pop_clust_top_10': gen_pop_clust_top_10}
compare_df = pd.DataFrame(compare_dict)
compare_df.plot(x='index', y = ['over_rep_top_10', 'gen_pop_clust_top_10'], kind='bar', figsize=(12,6))
sort_diff3 = new_df_2.pct_change(axis='columns')[1].sort_values()
ind = np.abs(sort_diff3).sort_values(ascending=False)[1:40].index
over_rep_top_10 = over_rep_df[ind].mean().round()
gen_pop_clust_top_10 = gen_pop_clust_df[ind].mean().round()
compare_dict = {'index': list(ind), 'over_rep_top_10': over_rep_top_10, 'gen_pop_clust_top_10': gen_pop_clust_top_10}
compare_df = pd.DataFrame(compare_dict)
compare_df.plot(x='index', y = ['over_rep_top_10', 'gen_pop_clust_top_10'], kind='bar', figsize=(12,6))
Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.
The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.
# mailout_train = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')
mailout_train = pd.read_csv('df_train.csv')
if 'Unnamed: 0' in mailout_train.columns:
mailout_train.drop(['Unnamed: 0'], axis=1, inplace= True)
mailout_train.head()
mailout_train_clean = clean_data(mailout_train)
features = [c for c in mailout_train_clean.columns if c not in ['RESPONSE']]
response = mailout_train_clean['RESPONSE']
training_imp = imp.transform(mailout_train_clean[features])
training_scaled = scaler.transform(training_imp)
training_scaled_df = pd.DataFrame(data= training_scaled, columns=mailout_train_clean[features].columns)
training_pca = pca.transform(training_scaled)
training_clusters = final_model.predict(training_pca)
mailout_train_clean['cluster'] = training_clusters
training_scaled_df['cluster'] = training_clusters
Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link here, you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!
Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.
# mailout_test = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TEST.csv', sep=';')
mailout_test = pd.read_csv('df_test.csv')
if 'Unnamed: 0' in mailout_test.columns:
mailout_test.drop(['Unnamed: 0'], axis=1, inplace= True)
mailout_test_clean = clean_data(mailout_test, no_del=False)
testing_imp = imp.transform(mailout_test_clean[features])
testing_scaled = scaler.transform(testing_imp)
testing_scaled_df = pd.DataFrame(data= testing_scaled, columns=mailout_test_clean[features].columns)
testing_pca = pca.transform(testing_scaled)
testing_clusters = final_model.predict(testing_pca)
mailout_test_clean['cluster'] = testing_clusters
testing_scaled_df['cluster'] = testing_clusters
param = {
'bagging_freq': 5,
'bagging_fraction': 0.4,
'boost_from_average':'false',
'boost': 'gbdt',
'feature_fraction': 0.05,
'learning_rate': 0.01,
'max_depth': -1,
'metric':'auc',
'min_data_in_leaf': 80,
'min_sum_hessian_in_leaf': 10.0,
'num_leaves': 13,
'num_threads': 8,
'tree_learner': 'serial',
'objective': 'binary',
'verbosity': 1
}
features.append('cluster')
folds = StratifiedKFold(n_splits=20, shuffle=False, random_state=42)
oof = np.zeros(len(mailout_train_clean))
predictions = np.zeros(len(mailout_test))
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(mailout_train_clean.values, response.values)):
print("Fold {}".format(fold_))
trn_data = lgb.Dataset(mailout_train_clean.iloc[trn_idx][features], label=response.iloc[trn_idx])
val_data = lgb.Dataset(mailout_train_clean.iloc[val_idx][features], label=response.iloc[val_idx])
num_round = 1000000
clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000)
oof[val_idx] = clf.predict(mailout_train_clean.iloc[val_idx][features], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
predictions += clf.predict(mailout_test_clean[features], num_iteration=clf.best_iteration) / folds.n_splits
print("CV score: {:<8.5f}".format(roc_auc_score(response, oof)))
print("CV score: {:<8.5f}".format(roc_auc_score(response, oof)))
cols = (feature_importance_df[["Feature", "importance"]]
.groupby("Feature")
.mean()
.sort_values(by="importance", ascending=False)[:150].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]
plt.figure(figsize=(14,28))
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('Features importance (averaged/folds)')
plt.tight_layout()
mailout_test_clean['RESPONSE'] = predictions
mailout_test_clean.rename(columns={'pred': 'RESPONSE'}, inplace= True)
(mailout_test_clean[['LNR', 'RESPONSE']]).to_csv('x.csv', index= False)